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Abstract 

This paper explores the automatic construc- 
tion of a muhihngual Lexical Knowledge Base 
from preexisting lexical resources. First, a set 
of automatic and complementary techniques for 
linking Spanish words collected from monolin- 
gual and bilingual MRDs to English WordNet 
synsets are described. Second, we show how re- 
sulting data provided by each method is then 
combined to produce a preliminary version of a 
Spanish WordNet with an accuracy over 85%. 
The application of these combinations results 
on an increment of the extracted connexions of 
a 40% without losing accuracy. Both coarse- 
grained (class level) and fine-grained (synset as- 
signment level) confidence ratios are used and 
evaluated. Finally, the results for the whole pro- 
cess are presented. 



1 Introduction 

There is no doubt about the increasing im- 
portance of using wide coverage ontologies for 
NLP tasks. Although available ontologies (Upper 
Model (iBateman 90D , CYC (|Lenat 95|) , WordNet 
( [Miller 9q ), ONTOS ([Nirenburg fc Defrise 93| ), 
Mikrokosmos, EDR ( [Yokoi 951) , etc.f\ differ in 
great extent on several characteristics (e.g. broad 
coverage vs. domain specific, lexically oriented 
vs. conceptually-oriented, granularity, kind of in- 
formation placed in nodes, kind of relations, way 
of building, etc.), it is clear that WordNet has be- 
come a de-facto standard for a wide range of NL 
applications. Developed at Princeton by George 



Mi dler and his research group (IVliller 9U|), the fig- 
ures the currently available version of WordNet 
1.5 (WN1.5) shows are impressive (119,217 words, 
91,587 synsets). WN1.5 is organised as a network 
of lexicalized concepts (Synsets) which are sets 
of word meanings (WMs) considered to be syn- 
onymous within a context. Synsets are connected 

This research has been partially funded by the Spanish 
Research Department (ITEM project TIC96-1243-C03-03), 
the Catalan Research department (CIRIT 1995SGR 00566) 
and EU Comission (Euro WordNet LE4003) 

^ See an overview and discussion of CYC, WordNet 
and EDR systems in Communications of the ACM 38:(11), 
pages 33-48, 1995. 



by several semantic relations (nevertheless, only 
that of hypernymy-hyponymy is considered in this 
work) . 

WordNet success has encouraged several 
projects in order to build WordNets (WNs) 
for other languages or to develop multilingual 
WNs. The most ambitious of such efforts 
is Euro WordNet (EWN)^ a project aiming to 
build a multilingual WordNet for several Euro- 
pean language^. The work we present here 
is included within EWN and presents our ap- 
proach for (semi) automatically building a Span- 
ish WN ( Climent et al. 96|) . The main strategy 
within our aproach is to map Spanish words to 
WN1.5 synsets, creating for Spanish a parallel- 
in-structure network. Therefore, our main goal is 
to attach Spanish word meanings to the existing 
WN1.5 concepts. This paper describes automatic 
techniques which have been developed in order to 
achieve this goal for nouns. 

Recently, several attemps have been performed 
to produce automatically multilingual ontologies. 
( [Ageno et al. 94^ link taxonomic structures de- 
rived from DGILE and LDOCE by means of a 
bilingual dictionary. ( [Knight fc Luk 9^ ) focus 
on the construction of Sensus, a large knowl- 
edge base for supporting the Pangloss Machine 
Translation system, merging ontologies (ONTOS 
and UpperModel) and WordNet with monolingual 
and bilingual dictionaries, ( pkumura fc Hovy 
94 ) describe a (semi) automatic method for as- 
sociating a Japanese lexicon to an ontology us- 
ing a Japanese/English bilingual dictionary as a 
"bridge". ( [Rigau et al. 95 ) link Spanish word 
senses to WordNet synsets using also a bilin- 
gual dictionary. ( Rigau fc Agirre 95[ ) exploit sev- 
eral bilingual dictionaries for linking Spanish and 
French words to WordNet synsets. 

Our approach for building the Spanish WN 

^EuroWordNet: Project LE- 4003 of the EU. 

^Initially three languages, apart from English, were in- 
volved: Dutch, Italian and Spanish. The project has been 
recently extended for covering French and German. 



(SpWN) is based on the following considerations: 

• The close conceptual similarity of English 
and Spanish allows for the preservation of 
the structure of WN1.5 in order to build the 
SpWN. Moreover, when necessary, lexicaliza- 
tion mismatches are solved using multi-word 
traslations (collocations) supplied by bilin- 
gual dictionaries. 

• An extensive use of pre-existing structured 
lexical sources is performed in order to 
achieve a massive automatic acquisition pro- 
cess. 

• The accuracy of cross-language mappings is 
validated by hand on a sample. Each attach- 
ment to WN bears a confidence score (CS). 

• Only attachments over a threshold are con- 
sidered. Moreover, a manual inspection of 
attachments in a given range will be carried 
out. 

Undoubtfully, following this aproach most of 
the criticisms placed to WN1.5 also apply to 
SpWN: too much sense fine-grainedness, lack of 
cross-POS relationships, simplicity of the rela- 
tional information, not purely lexical but rather 
lexical-conceptual database, etc. Despite of these 
drawbacks, WN1.5 is widely used and tested and 
supports few but the most basic semantic rela- 
tions. Our aproach ensures that most of the huge 
networking effort, which is necessary to build a 
WN from scratch, is already done. 

The different sources involved in the process 
show a different accuracy. High CSs can be as- 
signed to original sources, as MRDs, but derived 
sources, which result from the performance of au- 
tomatic procedures, come to bear substantially 
lower CSs. Our major claim is that multiple 
source/procedures leading to the same result will 
increase the particular CS while when leading to 
different results the overall CS will decrease. 

This paper is organized as follows. In section 2 
Lexical Knowledge resources used are presented. 
Section 3 describes the different types of extrac- 
tion/mapping methods developed. Main results 
and quality assesments issues are presented in 
Section 4. Section 5 presents some final remarks. 

2 Lexical Knowledge Sources 

Several lexical sources have been applied in order 
to assign Spanish WMs to WN1.5 synsets: 



1. Spanish/English and English/Spanish bilin- 
guals0 

2. A large Spanish monolingual dictionary^ 

3. English WordNet (WN1.5). 

By merging both directions of the bilingual 
dictionaries what we call homogeneous bilingual 
(HBil) has been obtained. The maximum synset 
coverage we can expect to reach by using HBil 
due to its small size is 32%. In table ^ the sum- 
marised amount of data is shown. 

3 Methods 

Bilingual entries must be disambiguated against 
WN. The different procedures developed for link- 
ing Spanish lexical entries to WN synsets can be 
classified in three main groups according to the 
kind of knowledge sources involved in the process: 

• Class methods: use as knowledge sources in- 
dividual entries coming from bilinguals and 
WN synsets. 

• Structural methods: take profit of the WN 
structure. 

• Conceptual Distance methods: makes use of 
knowledge relative to meaning closeness be- 
tween lexical concepts. 

Every method has been manually inspected in 
order to measure its CS. Such tests have been 
performed on a random sample of 10% using the 
Validation Interface (VI) , an enviroment designed 
to allow hand validation of Spanish word forms to 
WN synsets assignment. It allows to consult and 
to navigate through the monolingual and bilin- 
gual lexical databases and WN. The following di- 
agnostics can result from the performance of this 
validation: 

ok : correct links. 

ko : fully incorrect links. 

hypo : links to a hyponym of the correct synset. 

^Diccionario Vox/Harraps Esencial Espanol/Ingles - 
Ingles/Espanol Biblograf S.A. Barcelona 1992 

^DGILE: Diccionario General Ilustrado de la Lengua 
Espanola - Vox - M.Alvar (ed) Biblograf. S.A. Barcelona 
1987 

^Connections can be word/word or word/synset. 
When there are synsets involved the connections are 
Spanish- word/synset, (except for WordNet itself), other- 
wise Spanish- word/English- word. 





English nouns 


Spanish nouns 


Synsets 


Connections 


WordNetl.5 


87,642 


_ 


60,557 


107,424 


Spanish /English 


11,467 


12,370 


_ 


19,443 


English / Spanish 


10,739 


10,549 


- 


16,324 


Hbil 


15,848 


14,880 




28,131 


Maximum Reacheable Coverage 


12,665 


13,208 


19,383 


66,258 


of WordNet 


14% 




32% 




of bilingual 


80% 


90% 







Table 1: : Dictionary Statistics 



hyper : links to a hyperonym of the correct 
synset. 

near : links to near synonyms that could be con- 
sidered ok. 

3.1 Class Methods 

Following the properties described in ( [Rigau fc 
Agirre 9^ ) Hbil has been processed and 2 groups 
of 4 different cases have been collected depending 
on whether the English words are either monose- 
mous or polysemous relative to WN 1.5. After- 
wards two hybrid criteria are considered as well. 

3.1.1 Monosemic Criteria 

These criteria apply only to monosemous EW 
with respect to WN1.5. As a result, this unique 
synset is linked to the corresponding Spanish 
words. 

• Monosemic- 1 criterion (1:1) : 

SW EW 

Figure 1: Monosemic Criteria 

A Spanish Word (SW) has only one En- 
glish translation (EW); symmetrically, EW 
has SW as its unique traslation. 

• Monosemic-2 criterion (1:N with N>1): 

EW 

SW 

-EW 

Figure 2: Monosemic-2 Criteria 

A SW has more than one translation; each 
EW has SW as its unique traslation. 

Monosemic-3 criterion (M:l with M>1): 



SW 



SW 



EW 



Figure 3: Monosemic-3 Criteria 

Several SWs have the same translation; EW 
has several translations to Spanish. 

Monosemic-4 criterion (M:N with M>1 & 
N>1): 



SW 



SW 




EW 



EW 



Figure 4: Monosemic-4 Criteria 

Several SWs have different translations; EWs 
also have several translations. 

3.1.2 Polysemic Criteria 

These criteria follow the four criteria descrived 
in previous subsection but for polysemous English 
words (relative to WN1.5). 

3.1.3 Hybrid Criteria 

• Variant criterion 

For a WN1.5 synset which contains a set 
of variants EWs, if it is the case that two 
or more of the variants EWi have only one 
translation to the same Spanish word SW, 
a link is produced for SW into the WN1.5 
synset. 

• Field criterion 

This procedure makes use of the existence of 
a field identifier in some entries (over 4,000) 
of the English/Spanish bilingual. For each 
English entry bearing a field identifier (EW), 



if it is the case that both occur in the same 
synset, for each EW translation to Spanish 
a hnk is produced. Results of the manual 
verification for each criterion are shown in 
table 

3.2 Structural Methods 

In this set of methods the whole WN1.5 structure 
has been used to disambiguate. Prom HBil, all 
combinations of English words from 2 up to the 
maximum number of translations for each entry 
have been generated. The idea is to find as much 
common information between the corresponding 
EWs in WN1.5 as possible. On the extracted 
combinations, four experiments have been carried 
out resulting in the criteria described below: 

• Intersection criterion 

Conditions: All EWs share at least one com- 
mon synset in WordNet. Link: SW is linked 
to all common synsets of its translations. 

• Parent criterion 

Conditions: A synset of an EW is direct par- 
ent of synsets corresponding to the rest of 
EWs. Link: The SW is linked to all hyponym 
synsetsQ 

• Brother criterion 

Conditions: All EWs have synsets which 
are brothers respecting to a common parent. 
Link: The SW is linked to all co-hyponym 
synsets. 

• Distant hyperonymy criterion 

Conditions: A synset of an EW is a dis- 
tant hypernym of synsets of the rest of the 
EWs. Link: The Spanish Word is linked to 
the lower-level (hyponym) synsets. 

As the results of all these criteria follow a struc- 
ture like: 

Spanish- Word <list-of-EW> <list-of-synsets>, 
the Structural Criteria have been subsequently 

pruned by deleting repeating entries subsumed by 

larger ones. 

The overall results of Structural criteria are 
shown in table H. 



previous experiment assigning SW only to the hy- 
pernym synset (assuming this would appropriately capture 
global information) resulted in too general assignments. 



A finer-grained experiment has been performed 
on the size of the translation list. We have found 
that the larger this size is, the higher is the pre- 
cision obtained and, even more important, the 
lower is the KO-ratio. The results for the case 
of intersection criterion are shown in table ^. 



#WORDS 


%OK 


%KO 


%HYPO 


2 


81,39 


3,48 


1,51 


3 


91,89 


0,0 


5,4 


4 


94,4 


0,0 


0,0 



Table 4: Results for the Intersection Criteria 

3.3 Conceptual Distance Methods 

Taking as reference a structured hierarchical net, 
conceptual distance tries to provide a basis for 
determining closeness in meaning among words. 
Conceptual distance between two concepts is de- 



fined in ( Rada et al. 8£ ) as the length of the short- 
est path that connects the concepts in a hierarchi- 



cal semantic net. In a similar approach, (Sussna 
93) employs the notion of conceptual distance be- 
tween network nodes in order to improve preci- 
sion during document indexing. Pollowing these 
ideas, ( [Agirre et al. 94 ) describe a new conceptual 
distance formula for automatic spelling correction 
and ( Rigau 9"^ ), using this conceptual distance 
formula, presents a methodology to enrich dic- 
tionary senses with semantic tags extracted from 



WordNet. The same measure is used in ( Rigau 
et al. 95|) for linking taxonomies extracted from 



DGILE and LDOCE and in ( |Rigau et al~^ as 
one of the methods for the Genus Sense Disam- 
biguation problem in DGILE. Conceptual density, 
a more complex semantic measure among words is 
defined in ( Agirre & Rigau 95) and used in (Agirre 
|fc Rigau 961) as a proposal for WSD of the Brown 
Corpus. The Conceptual Distance formula used 



in this work, also described in ( Agirre et al. 94 
is shown in Pigure 5. 



dist{wi,W2) 



mm 

ci.etoi 

C2j &W2 



E 



path{ci.,C2-) 



depth{ck) 



(1) 

Figure 5: Conceptual distance formula 
where Wi are words and Ci are synsets represent- 
ing those words. Conceptual Distance between 
two words depends on the length of the shortest 
path that connects the concepts and the speci- 



Criterion 


#LlNKS 


#Synsets #Words 


%OK 


%KO 


%HYPO 


%HYPER 


%NEAR 


monol 


3,697 


3,583 3,697 


92 


2 


2 





2 


mono2 


935 


929 661 


89 


1 


5 





3 


mono3 


1,863 


1,158 1,863 


89 


5 





2 


1 


mono4 


2,688 


1,328 2,063 


85 


3 


6 


2 


4 


polyl 


5,121 


4,887 1,992 


80 


12 








6 


poly2 


1,450 


1,426 449 


75 


16 


2 





5 


poly3 


11,687 


6,611 3,165 


58 


35 





1 


5 


poly4 


40,298 


9,400 3,754 


61 


23 


5 


1 


9 


Variant 


3,164 


2,195 2,261 


85 


4 


4 


1 


6 


Field 


510 


379 421 


78 


9 


2 


2 


9 






Table 2: Results of class methods 








Criterion 


#LlNKS 


#Synsets #Words 


%OK 


%KO 


%HYPO 


%HYPER 


%NEAR 


inters 


1,256 


966 767 


79 


4 


8 





9 


parent 


1,432 


1,210 788 


51 


3 


30 





14 


brother 


2,202 


1,645 672 


57 


5 


22 





16 


distant 


1,846 


1,522 866 


60 


4 


23 





13 



Table 3: Overall results for the Structural Criteria 



ficity of the concepts in the path. Then, providing 
two words, the application of the Conceptual Dis- 
tance formula selects those closer concepts which 
represent them. 

Following this approach, three different meth- 
ods have been applied. 



3.3.1 



Using Co-occurrence w^ords 
collected from DGILE (GDI) 



Following ( Wilks et al. 93 ) two words are 
coocurrent in a dictionary if they appear in the 
same definition. For DGILE, a lexicon of 300,062 
coocurrence pairs among 40,193 Spanish word 
forms was derived and the afinity between these 
pairs was measured by means of the Association 
Ratio (AR), which can be used as a fine grained 
CS. 

Then, the Conceptual Distance formula for all 
those pairs has been computed using HBil and the 
nominal part of WN. 

3.3.2 Using Headword and genus of 
DGILE (CD2) 

Computing the Conceptual Distance formula 
on the headword and the genus term of 92,741 
nominal definitions of DGILE dictionary (only 
32,208 with translation to English). 



3.3.3 Using Spanish entries with 
multiple translations in the 
bilingual dictionary (GD3) 

In this case, we have derived a small but 
closely related lexicon of 3,117 translation equiv- 
alents with multiple translations from the Span- 
ish/English direction of the bilingual dictionary 
(only 2,542 with connection to WordNetl.5). 

Table |5| summarizes the performance of the 
three Conceptual Distance methods. 

4 Combining methods 

Collecting those synsets produced by the methods 
described above with an accuracy greater than 
85% (monol, mono2, mono3, mono4, variants, 
field) we obtain a preliminary version of the Span- 
ish WordNet containing 10,982 connections (1,777 
polysemous) among 7,131 synsets and 8,396 Span- 
ish nouns with an overall CS of 87,4%. However, 
combining the discarded methods we can take 
profit of portions of them precise enough to be 
acceptable. 

All files resulting from discarded methods were 
crossed and their intersections were calculated. 
Using VI, a manual inspection of samples from 
each intersection was carried out. Results are 
shown in table ^. 

In bold appear intersections with a CS greater 
than 85%. Up to 7,244 connections (2,075 pol- 
ysemous) can be selected with 85.63% CS, 4,553 



Criterion 


#LlNKS 


#Synsets 


#WORDS 


%OK %KO %HYPO 


%HYPER 


%NEAR 


CD - 1 


23,828 




11,269 


7,283 




56 


38 


3 


2 


2 


CD - 2 


24,739 




12,709 


10,300 




61 


35 








3 


CD - 3 


4,567 




3,089 


2,313 




75 


12 





2 


8 




Table 5: 


Performace of Conceptual Distance methods 








met hod 2 




















methodl 




cdl 


cd2 


cd3 


dist 


fath 


pl 


p2 


p3 


p4 


bro 


size 


855 


828 


435 


449 


405 


76 


107 





1,872 




%ok 


70 


71 


79 


58 


6 


86 


89 





67 


cdl 


size 





15,736 


1,849 


576 


419 


2,076 


556 


3,146 


15,105 




%ok 





79 


85 


68 


71 


86 


86 


72 


64 


cd2 


size 








2,401 


571 


428 


2,536 


592 


3,777 


13,246 




%ok 








86 


71 


72 


88 


86 


75 


67 


cd3 


size 











391 


325 


205 


180 


215 


3,114 




%ok 











79 


80 


95 


95 


100 


77 


dist 


size 














1,432 


69 


68 





1,463 




%ok 














67 


78 


7 





65 


fath 


size 

















69 


61 





1,101 




%ok 

















77 


70 





67 


pl 


size 























77 


178 




%ok 























100 


88 


p2 


size 























28 


78 




%ok 























77 


96 



Table 6: Results combining methods 



of which are new with an overall CS of 84% re- 
sulting in a 41% increase. It must be pointed out 
that 1,308 new connections are polysemous. 

Then a second version of the Spanish Word- 
Net has been obtained containing 15,535 connec- 
tions (3,373 polysemous) among 10,786 synsets 
and 9,986 Spanish nouns with a final accuracy of 
86,4%. Table shows the overall fi gures of the 
resulting SpWNs. 

5 Conclusions 

An approach for building multilingual Wordnets 
combining a variety of lexical sources as well 
variety of methods has been proposed which tries 
to take profit of the existing WN1.5 for attach- 
ing words from other languages in a way guided 
mainly by the content of bilingual lexical sources. 

A central issue of our approach is the combi- 
nation of methods and sources in a way that the 
accuracy of the data obtained from the combined 
methods overcomes the accuracy obtained from 
the individual ones. Several families of methods 
have been tested, each of them bearing its own 
CS. Only those methods offering a result over a 



threshold (85%) have been considered. 

In a second phase of our experiments, intersec- 
tions between the results provided by the differ- 
ent individual methods have been performed. It 
is clear that valuable sets of entries, owning an 
insufficient, in some cases rather bad, individual 
CS, can be, however, extracted if they occur as 
a combination of several methods. In this way, 
using the same threshold, the amount of synsets 
attached to Spanish entries has increased. It must 
be pointed out that some of these new connections 
correspond to highly polysemous words. 

The approach seems to be extremely promising, 
attaching up to 75% of reachable Spanish nouns 
and 55% of reachable WN1.5 synsets. Currently 
we are performing complementary experiments 
for extending the approach for covering other lex- 
ical sources, specially wider-coverage bilinguals. 

Other lines of research we are following by now 
include: 1) dealing with mismatches, i.e., when 
coming from different method/source an Spanish 
word is assigned to different synsets. If in the 
former case the overall CS increases, in the last 
one it should decrease. 2) A fine grained cross- 



Criterion 


#LlNKS 


#Synsets 


#WORDS 


#CS 


#PoLY Links 


SpWN v.0.0 


10,982 


7,131 


8,396 


87.4 


1,777 


Combination 


7,244 


5,852 


3,939 


85.6 


2,075 


SpWN v.0.1 


15,535 


10,786 


9,986 


86.4 


3,373 



Table 7: Overall Figures of SpWNs 



comparison of methods and sources (intersections 
of more than two classes, decomposition of classes 
into finer ones, etc.) will be performed to obtain a 
more precise classification and CS assignment. 3) 
We are trying to obtain an empirical method for 
CS calculation of intersections. Methods based on 
bayesian inference networks or quasiprobabilistic 
approaches has been tested giving promising re- 
sults. 
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